from IPython.display import display, HTML
display(HTML("""
<style>
/* JupyterLab: center notebook and leave margins */
.jp-NotebookPanel-notebook {
width: 85% !important;
margin-left: auto !important;
margin-right: auto !important;
max-width: 1100px !important; /* optional cap */
}
/* Make output area match nicely */
.jp-OutputArea {
max-width: 100% !important;
}
</style>
"""))
# make high-resolution of images.
%config InlineBackend.figure_format = 'svg'
0. Preparing Your Environment¶
0.1 Install Anaconda and setup environment¶
- Step 1: Install Anaconda. Download and install Anaconda from: https://www.anaconda.com/download.
Step 2: Setup your git. Make sure you have git and ssh in your system (macOS usually already has both git and ssh). After the above step, you are ready to download our course github project. In your Ubuntu/Linux system, install git and sssh via
sudo apt updatesudo apt install -y git openssh-clientThen, generate an SSH key (ed25519)
ssh-keygen -t ed25519 -C "your_email@example.com"Press Enter for the default location, and optionally set a passphrase. This creates:
- ~/.ssh/id_ed25519 (private key)
- ~/.ssh/id_ed25519.pub (public key)
Then, you can start ssh-agent and add the key:
eval "$(ssh-agent -s)"ssh-add ~/.ssh/id_ed25519Add the public key to GitHub, that is, copy the generated key:
cat ~/.ssh/id_ed25519.pubIt will be something like,
ssh-ed25519 XXXXXXX...XXXXXXX your-email@xxx.com. Copy this line, and then in GitHub: Settings → SSH and GPG keys → New SSH key → paste. To test the SSH connection, you can usessh -T git@github.com, if it says something like “Hi USERNAME! …”, you’re good. After these steps, you are ready to clone our course github viagit clone git@github.com:baojian/llm-26.gitAfter the above steps, you are ready to create our course env.
Step 3: Create and activate the course environment. Make sure you have installed the conda environment.
Option 1: If you have GPU in your machine (Linux + NVIDIA GPU), please create your env via
cd llm-26/lecture-01/ # make sure your are in our course folder.conda env create -f environment-gpu.ymlActivate it with:
conda activate llm-26-gpuNote: environment-gpu.yml uses
pytorch-cuda=12.1. If your GPU driver/CUDA setup does not support CUDA 12.1, change this version accordingly (e.g., 11.8).Option 2: If you use macOS / Windows (WSL) / Linux CPU-only, please create your env via
conda env create -f environment.ymlActivate it with:
conda activate llm-26-cpuNote, if you got errors when installing
en_core_web_smorzh_core_web_sm, please remove these two from yml file and install separately in your env viaconda activate llm-26 # or conda activate llm-26-gpu python -m spacy download en_core_web_sm python -m spacy download zh_core_web_sm
After the above steps, you are ready to download our course github project.
Step 4: Open your jupyter notebook.
All your code will run on Jupyter notebook, you can activate jupyter notebook, via
cd llm-26 # make sure you are in our course folderconda activate llm-26-cpu # or conda activate llm-26-gpujupyter labFor students using Windows WSL (Ubuntu 22.04), even though Jupyter runs inside WSL (Ubuntu), you can open it directly in the Windows browser via port forwarding (usually automatic). In WSL Ubuntu, start Jupyter like this:
jupyter lab --no-browser --ip=127.0.0.1 --port=8888It will print a URL like:
http://127.0.0.1:8888/lab?token=...Now on Windows, open your browser and go to:
http://localhost:8888/labPaste the token if it asks. On WSL2, Windows automatically forwards localhost:8888 to the WSL instance in most setups. If localhost:8888 doesn’t work, in WSL run:hostname -I, suppose it prints something like 172.27.123.45 ..., then open in Windows:http://172.27.123.45:8888/lab
0.2 Checking your device¶
- After install conda and your env, you can check whether these packages are installed in the right way.
python -c "import torch; print('torch', torch.__version__); print('cuda?', torch.cuda.is_available()); print('mps?', getattr(torch.backends,'mps',None) is not None and torch.backends.mps.is_available()); print('cuda device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'no gpu'); print('using:', 'cuda' if torch.cuda.is_available() else ('mps' if (getattr(torch.backends,'mps',None) is not None and torch.backends.mps.is_available()) else 'cpu'))"
!python -c "import torch; print('torch', torch.__version__); print('cuda?', torch.cuda.is_available()); print('mps?', getattr(torch.backends,'mps',None) is not None and torch.backends.mps.is_available()); print('cuda device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'no gpu'); print('using:', 'cuda' if torch.cuda.is_available() else ('mps' if (getattr(torch.backends,'mps',None) is not None and torch.backends.mps.is_available()) else 'cpu'))"
- The following code provides a more symmetric way to perform the same checks. In our lectures, some datasets are very large, so it’s a good idea to make sure you have enough disk space. You can check your disk, memory, and CPU information using the code below.
import torch
def detect_torch_device(verbose: bool = True) -> str:
"""
Returns one of: 'cuda', 'mps', 'cpu'
Priority: CUDA GPU > Apple MPS > CPU
"""
has_cuda = torch.cuda.is_available()
has_mps = getattr(torch.backends, "mps", None) is not None and torch.backends.mps.is_available()
if has_cuda:
device = "cuda"
elif has_mps:
device = "mps"
else:
device = "cpu"
if verbose:
print(f"torch: {torch.__version__}")
print(f"device: {device}")
if has_cuda:
print(f"cuda devices: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
print(f" [{i}] {torch.cuda.get_device_name(i)}")
elif has_mps:
print("mps available: True (Apple Metal)")
else:
print("cpu only")
return device
device = detect_torch_device()
# generate 2x3 random matrix to check torch
x = torch.rand(2, 3)
x = x.to(device)
print("device:", x.device)
- Exercise System Info Cell
import os
import socket
import platform
import shutil
import datetime
def bytes_to_gb(x: int) -> float:
return x / (1024 ** 3)
def system_report(path: str = "."):
print("=== System Report ===")
print("OS:", platform.system(), platform.release())
print("Platform:", platform.platform())
now = datetime.datetime.now().astimezone()
print("Local time:", now.strftime("%Y-%m-%d %H:%M:%S %Z%z"))
print("Hostname :", socket.gethostname())
print("User :", os.getenv("USER") or os.getenv("USERNAME") or "unknown")
print("\n=== OS / Python ===")
print("OS :", platform.system(), platform.release())
print("Version :", platform.version())
print("Machine :", platform.machine())
print("Processor :", platform.processor() or "unknown")
print("Python :", platform.python_version())
# CPU
print("\n--- CPU ---")
print("CPU cores (logical):", os.cpu_count())
# Memory (best-effort, cross-platform)
print("\n--- Memory (RAM) ---")
try:
import psutil # you already have this in env
vm = psutil.virtual_memory()
print(f"Total: {bytes_to_gb(vm.total):.2f} GB")
print(f"Available: {bytes_to_gb(vm.available):.2f} GB")
print(f"Used: {bytes_to_gb(vm.used):.2f} GB ({vm.percent}%)")
except Exception as e:
print("psutil not available or failed:", e)
# Disk
print("\n--- Disk ---")
total, used, free = shutil.disk_usage(path)
print("Path checked:", os.path.abspath(path))
print(f"Total: {bytes_to_gb(total):.2f} GB")
print(f"Free: {bytes_to_gb(free):.2f} GB")
print(f"Used: {bytes_to_gb(used):.2f} GB")
# PyTorch device
print("\n--- PyTorch Device ---")
try:
import torch
cuda = torch.cuda.is_available()
mps = hasattr(torch.backends, "mps") and torch.backends.mps.is_available()
print("torch:", torch.__version__)
print("CUDA available:", cuda)
print("MPS available:", mps)
if cuda:
print("GPU:", torch.cuda.get_device_name(0))
device = "cuda" if cuda else ("mps" if mps else "cpu")
print("Suggested device:", device)
except Exception as e:
print("torch not available or failed:", e)
system_report(".")
1.1 Install ollama and download LLMs¶
We can download some popular LLMs such as Qwen series.
Step 1: Please download and install ollama via download.
Step 2: After the installation of ollama, you can download qwen3:0.6b and qwen3:1.7b via:
ollama run qwen3:0.6bIt will automatically download the model into your local disk. It takes about 498MB disk space. Similarly, you can run a larger version
qwen3:1.7band it takes about 1.3GB disk space.ollama run qwen3:1.7bThe terminal looks like this:
- Step 2: Ask a question, like:
please tell me what is NLP ?, it gives us:
1.2 Get response for a given prompt¶
Call Ollama API
Next, we go a little deeper: we will call the Ollama API from Python to generate outputs given prompts. You can check whether Ollama is running by typing http://127.0.0.1:11434/ in Chrome.
import os
import time
import math
import requests
# If you have proxy, make sure bypass proxies for local services (e.g., Ollama on localhost:11434)
# export OLLAMA_HOST=http://127.0.0.1:11434
# export NO_PROXY=localhost,127.0.0.1
# export no_proxy=localhost,127.0.0.1
os.environ["OLLAMA_HOST"] = "http://127.0.0.1:11434"
os.environ["NO_PROXY"] = "localhost,127.0.0.1"
os.environ["no_proxy"] = "localhost,127.0.0.1"
s = requests.Session()
s.trust_env = False # IMPORTANT <- ignore ALL proxy env vars
print(s.get("http://127.0.0.1:11434/api/version", timeout=3).json())
# !!!! make sure to import after the s.trust_env !!!!
import ollama
from ollama import chat
from ollama import ChatResponse
s = requests.Session()
s.trust_env = False # <- ignore ALL proxy env vars
response: ChatResponse = chat(model='qwen3:0.6b', messages=[
{
'role': 'user',
'content': 'Fudan University is located in which city? Answer with one word.',
},
])
print(response['message']['content'])
- Will the same prompt always produce the same output?
models = ["qwen3:0.6b", "qwen3:1.7b"]
prompt = "Fudan University is located in which city? Answer with one word."
for model in models:
print('-' * 50)
start_time = time.time()
for _ in range(10):
resp = ollama.generate(
model = model,
prompt = prompt
)
print(f"{model} with resp {_ + 1}: {resp["response"]}")
print(f'total runtime of 10 responses of {model} is: {time.time() - start_time:.1f} seconds')
Some key observations:
- The smaller model, Qwen3:0.6B, tends to produce lower-quality answers, while the larger model, Qwen3:1.7B, generally produces higher-quality answers.
- The responses are somewhat random: running the same prompt multiple times may yield different outputs.
- There are ways to make the output deterministic. For example, the code below sets decoding options so that the model always selects the highest-probability token at each step.
resp = ollama.generate( model=model, prompt=prompt, options={ "temperature": 0.0, "top_p": 1.0, "top_k": 0, # optional: "num_predict": 32, }, ) print(resp["response"])
Let us generate some response that are not one word but a sequence of words.
model = "qwen3:1.7b"
prompt = "I am an undergraduate student, please explain LLMs in three sentences."
resp = ollama.generate(
model=model,
prompt=prompt
)
print(f"Prompt: {prompt} \nResp: {resp["response"]}")
1.3 Get response probability from LLMs¶
- Get token probability
model = "qwen3:0.6b" # "qwen3:1.7b"
prompt = "Fudan University is located in which city? Answer with one word."
num_top_tokens = 20 # number of top alternatives per generated token
resp = ollama.generate(
model = model,
prompt = prompt,
stream = False,
logprobs = True,
think = False,
top_logprobs = num_top_tokens
)
print("response:", repr(resp["response"]))
# Each element usually corresponds to one generated token
for i, lp in enumerate(resp.get("logprobs", [])):
tok = lp.get("token")
logp = lp.get("logprob")
p = math.exp(logp) if logp is not None else None
print(f"{i:02d} token={tok!r:>16} logp={logp: .4f} p={p:.4f}")
- Get specific token probability distribution
import os, math, ollama
model = "qwen3:0.6b"
prompt = "Fudan University is located in which city? Answer with one word."
res = ollama.generate(
model=model,
prompt=prompt,
logprobs=True,
think = False,
top_logprobs=10,
options={"temperature": 0.0, # greedy decoding, it pick the maximal token
"num_predict": 20,
"think": False # do not use thinking model.
},
)
answer = ''
lp = res["logprobs"]
tokens = [d.get("token", "") for d in lp]
print(f'We use model: {model}')
for i in range(len(lp)):
tok = lp[i].get("token", "")
logp = lp[i].get("logprob", None)
alts = lp[i].get("top_logprobs", [])
p = math.exp(logp) if logp is not None else float("nan")
if tok == "\n" or tok == "\n\n": # stop when answer ends (often newline).
break
answer += tok
print(f"--- top probabilities of token-{i:02d} ---")
for a in alts[:20]:
prob_a = math.exp(a['logprob'])
print(f"{a['token']!r:>12}:{prob_a:.5f}")
print(f"Partial Response: {answer}\n")
print(f"Final Response: {answer}")
- LLMs are probabilitic distributions
From the outputs above, you can see that the response from these LLMs may differ from run to run. In fact, LLMs are probabilistic models: given the same prompt (i.e., question), they may produce different responses (i.e., answers). We can describe this inference process using mathematical notation.
Let $p_\theta(\cdot)$ denote a trained language model. Here, you can think of $p_\theta$ as Qwen3:0.6B, Qwen3:1.7B, or any other model. Given the following prompt
Prompt = Fudan University is located in which city? Answer with one word.
This prompt is a sequence of tokens. The model outputs a response, which is also a sequence of tokens. In other words, the model uses an inference procedure to predict the next token(s) conditioned on the prompt. In mathematical terms, it models the conditional probability
$$ p_\theta \left(w_{n+1} \mid w_1,w_2,\ldots,w_n\right), $$where
- Prompt = Fudan University is located in which city? Answer with one word. = $[w_1,w_2,\ldots,w_n]$,
- Response = $[w_{n+1}]$ (in this simplified one-word setting).
The model then outputs the most likely token $w_{n+1}$ (or samples from the distribution) using its own inference algorithm. We will discuss inference strategies in later lectures. By the definition of conditional probability, we have
$$ p_\theta \left(w_{n+1} \mid w_1,w_2,\ldots,w_n\right) = \frac{ p_\theta \left(w_1,w_2,\ldots,w_n,w_{n+1}\right) }{ p_\theta \left(w_1,w_2,\ldots,w_n\right) }. $$
The key takeaway is: if we can learn a model that assigns probabilities to sequences of arbitrary length, i.e., $p_\theta(w_1,w_2,\ldots,w_k)$ for any positive integer $k$, then conditional probabilities—and therefore next-token prediction—follow naturally from these joint probabilities.
- Get token probabilities from Qwen3:1.7b
model = "qwen3:1.7b"
prompt = "Fudan University is located in which city? Answer with one word."
res = ollama.generate(
model=model,
prompt=prompt,
logprobs=True,
think = False,
top_logprobs=10,
options={"temperature": 0.0, # greedy decoding, it pick the maximal token
"num_predict": 20,
"think": False # do not use thinking model.
},
)
print(res["response"])
- If we do not use thinking mode, it will gives the above response. However, we can still see how Shanghai is chosen during the inference stage.
import os, math, ollama
model = "qwen3:1.7b"
prompt = "Fudan University is located in which city? Answer with one word."
res = ollama.generate(
model=model,
prompt=prompt,
logprobs=True,
think = False, # Do not use the thinking model.
top_logprobs=10,
options={"temperature": 0.0, # greedy decoding, it pick the maximal token
"num_predict": 20,
"think": False # do not use thinking model.
},
)
answer = ''
lp = res["logprobs"]
tokens = [d.get("token", "") for d in lp]
print(f'We use model: {model}')
for i in range(len(lp)):
tok = lp[i].get("token", "")
logp = lp[i].get("logprob", None)
alts = lp[i].get("top_logprobs", [])
p = math.exp(logp) if logp is not None else float("nan")
if tok == "\n" or tok == "\n\n": # stop when answer ends (often newline).
break
answer += tok
print(f"--- top probabilities of token-{i:02d} ---")
for a in alts[:20]:
prob_a = math.exp(a['logprob'])
print(f"{a['token']!r:>12}:{prob_a:.5f}")
print(f"Partial Response: {answer}\n")
print(f"Final Response: {answer}")
2. Basics for Python and spaCy¶
Python: We will use Python-3.12 in our course.
spaCy: As introduced in https://github.com/explosion/spaCy, spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. We will use it to demonostrate how to do text tokenization.
nltk tool: In our previous courses, we will introduce nltk tool for tokenization stuff. But we will not cover this part in our new course as these tools are largely iirelvant and outdated. One can find details of nltk at https://github.com/nltk/nltk and website at https://www.nltk.org/.
2.1 Python basics: regular expression¶
# in Python, there is a built in lib re, we can import them
import re
- Disjunction []
# Task: Find woodchuck or Woodchuck : Disjunction
test_str = "This string contains Woodchuck and woodchuck."
result=re.search(pattern="[wW]oodchuck", string=test_str)
print("Matched" if result is not None else "Not found")
result=re.search(pattern=r"[wW]ooodchuck", string=test_str)
print("Matched" if result is not None else "Not found")
# Find the word "woodchuck" in the following test string
test_str = "interesting links to woodchucks ! and lemurs!"
result = re.search(pattern="woodchuck", string=test_str)
print("Matched" if result is not None else "Not found")
# Find !, it follows the same way:
test_str = "interesting links to woodchucks ! and lemurs!"
result = re.search(pattern="!", string=test_str)
print("Matched" if result is not None else "Not found")
result = re.search(pattern="!!", string=test_str)
print("Matched" if result is not None else "Not found")
assert re.search(pattern="!!", string=test_str) == None # match nothing
- Disjunction [0-9]
# Find any single digit in a string.
result=re.search(pattern=r"[0123456789]", string="plenty of 7 to 5")
print("Matched" if result is not None else "Not found",
result)
result=re.search(pattern=r"[0-9]", string="plenty of 7 to 5")
print("Matched" if result is not None else "Not found",
result)
- Negation:[^
# Negation: If the caret ^ is the first symbol after [,
# the resulting pattern is negated. For example, the pattern
# [^a] matches any single character (including special characters) except a.
# -- not an upper case letter
print(re.search(pattern=r"[^A-Z]", string="Oyfn pripetchik"))
# -- neither 'S' nor 's'
print(re.search(pattern=r"[^Ss]", string="I have no exquisite reason for't"))
# -- not a period
print(re.search(pattern=r"[^.]", string="our resident Djinn"))
# -- either 'e' or '^'
print(re.search(pattern=r"[e^]", string="look up ^ now"))
# -- the pattern ‘a^b’
print(re.search(pattern=r'a\^b', string=r'look up a^b now'))
- Operation: |
# More disjuncations
str1 = "Woodchucks is another name for groundhog!"
result = re.search(pattern="groundhog|woodchuck",string=str1)
print(result)
str1 = "Find all woodchuckk Woodchuck Groundhog groundhogxxx!"
result = re.findall(pattern="[gG]roundhog|[Ww]oodchuck",string=str1)
print(result)
- Operation:?,*,+,.,$
# Some special chars
# ?: Optional previous char
str1 = "Find all color colour colouur colouuur colouyr"
result = re.findall(pattern="colou?r",string=str1)
print(result)
# *: 0 or more of previous char
str1 = "Find all color colour colouur colouuur colouyr"
result = re.findall(pattern="colou*r",string=str1)
print(result)
# +: 1 or more of previous char
str1 = "baa baaa baaaa baaaaa"
result = re.findall(pattern="baa+",string=str1)
print(result)
# .: any char
str1 = "begin begun begun beg3n"
result = re.findall(pattern="beg.n",string=str1)
print(result)
str1 = "The end."
result = re.findall(pattern=r"\.$",string=str1)
print(result)
str1 = "The end? The end. #t"
result = re.findall(pattern=r".$",string=str1)
print(result)
# find all "the" in a raw text.
text = "If two sequences in an alignment share a common ancestor, \
mismatches can be interpreted as point mutations and gaps as indels (that \
is, insertion or deletion mutations) introduced in one or both lineages in \
the time since they diverged from one another. In sequence alignments of \
proteins, the degree of similarity between amino acids occupying a \
particular position in the sequence can be interpreted as a rough \
measure of how conserved a particular region or sequence motif is \
among lineages. The absence of substitutions, or the presence of \
only very conservative substitutions (that is, the substitution of \
amino acids whose side chains have similar biochemical properties) in \
a particular region of the sequence, suggest [3] that this region has \
structural or functional importance. Although DNA and RNA nucleotide bases \
are more similar to each other than are amino acids, the conservation of \
base pairs can indicate a similar functional or structural role."
matches = re.findall("[^a-zA-Z][tT]he[^a-zA-Z]", text)
print(matches)
# A nicer way is to do the following
matches = re.findall(r"\b[tT]he\b", text)
print(matches)
import re
word_re = re.compile(r"[A-Za-z]+(?:'[A-Za-z]+)?")
text = """Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 ,\
lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles \
III outside Japan , is a tactical role @-@ playing video game developed by Sega and \
Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , \
it is the third game in the Valkyria series . Employing the same fusion of tactical \
and real @-@ time gameplay as its predecessors"""
tokens = word_re.findall(text)
print(len(tokens))
print(tokens)
2.2 (Old way) Tokenization tool: spaCy¶
Tokenization via spaCy. We assume you already installed spaCy tool. If you haven't installed yet, open your terminal and activate your llm-26 env, and then run the following to download English and Chinese LMs.
python -m spacy download en_core_web_smpython -m spacy download zh_core_web_sm
There are two type of tokenizations:
- Top-down tokenization: We define a standard and implement rules to implement that kind of tokenization.
- word tokenization (spaCy, nltk)
- charater tokenization (also can be done via spaCy)
- Bottom-up tokenization: We use simple statistics of letter sequences to break up words into subword tokens.
- subword tokenization (modern LLMs use this type!)
- Top-down (rule-based) tokenization - word tokenization
- Simple tokenization via white space
# Use split method via the whitespace " "
text = """While the Unix command sequence just removed all the numbers and punctuation"""
print(text.split(" "))
# But, we have punctuations, icons, and many other small issues.
text = """Don't you love 🤗 Transformers? We sure do."""
print(text.split(" "))
- White space cannot tokenize Chinese,Japanese,...
text = '姚明进入总决赛'
print(text.split(" "))
- spaCy works much better
import spacy
nlp = spacy.load("zh_core_web_sm")
text = '姚明进入总决赛'
doc = nlp(text)
print([token for token in doc])
- Chinese Character level tokenization
from spacy.lang.zh import Chinese
nlp_ch = Chinese()
text = '姚明进入总决赛'
print([*nlp_ch(text)])
- English Character level tokenization
import spacy
from spacy.tokens import Doc
nlp = spacy.blank("en")
text = "Hello, world!"
chars = [c for c in text]
doc = Doc(nlp.vocab, words=chars)
print([t.text for t in doc])
# ignore white space
chars = [c for c in text if not c.isspace()]
doc = Doc(nlp.vocab, words=chars)
print([t.text for t in doc])
text = """Special characters and numbers will need to be kept in prices ($45.55) and dates (01/02/06);
we don’t want to segment that price into separate tokens of “45” and “55”. And there are URLs (https://www.stanford.edu),
Twitter hashtags (#nlproc), or email addresses (someone@cs.colorado.edu)."""
text = text.replace("\n", " ").strip()
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for token in doc:
print(token)
- Sentence-level tokenization
text = """自然语言处理(英语:Natural Language Processing,缩写作 NLP)是人工智能和语言学领域的交叉学科,\
研究计算机处理、理解与生成人类语言的技术。此领域探讨如何处理及运用自然语言;自然语言处理包括多方面和步骤,\
基本有认知、理解、生成等部分。自然语言认知和理解是让电脑把输入的语言变成结构化符号与语义关系,\
然后根据目的再处理。自然语言生成系统则是把计算机数据转化为自然语言。\
自然语言处理要研制表示语言能力和语言应用的模型, 建立计算框架来实现并完善语言模型,\
并根据语言模型设计各种实用系统及探讨这些系统的评测技术。"""
nlp = spacy.load("zh_core_web_sm")
doc = nlp(text)
print([sent.text for sent in doc.sents])
# You need to install it via: python -m spacy download zh_core_web_sm
from spacy.lang.zh.examples import sentences
nlp = spacy.load("zh_core_web_sm")
doc = nlp(sentences[0])
text = """\
字节对编码是一种简单的数据压缩形式,这种方法用数据中不存的一个字节表示最常出现的连续字节数据。\
这样的替换需要重建全部原始数据。字节对编码实例: 假设我们要编码数据 aaabdaaabac, 字节对“aa” \
出现次数最多,所以我们用数据中没有出现的字节“Z”替换“aa”得到替换表Z <- aa 数据转变为 ZabdZabac. \
在这个数据中,字节对“Za”出现的次数最多,我们用另外一个字节“Y”来替换它(这种情况下由于所有的“Z”都将被替换,\
所以也可以用“Z”来替换“Za”),得到替换表以及数据 Z <- aa, Y <- Za, YbdYbac.
"""
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
for ind, sent in enumerate(sentences):
print(f"sentence-{ind}: {sent}\n")
- spaCy handles special tokens
# spacy works much better
nlp = spacy.load('en_core_web_sm')
text = """Special characters and numbers will need to be kept in prices ($45.55) and dates (01/02/06);
we don’t want to segment that price into separate tokens of “45” and “55”. And there are URLs (https://www.stanford.edu),
Twitter hashtags (#nlproc), or email addresses (someone@cs.colorado.edu)."""
doc = nlp(text)
print([token for token in doc])
prompt = "Fudan University is located in Shanghai"
nlp = spacy.load("en_core_web_sm")
doc = nlp(prompt) # i want to do text preprocessing for our prompt.
# Text: The original word text.
# Lemma: The base form of the word.
# POS: The simple UPOS part-of-speech tag.
# Tag: The detailed part-of-speech tag.
# Dep: Syntactic dependency, i.e. the relation between tokens.
# Shape: The word shape – capitalization, punctuation, digits.
# is alpha: Is the token an alpha character? (whether it consists only of letters from the alphabet (A-Z or a-z))
# is stop: Is the token part of a stop list, i.e. the most common words of the language?
# (A stop list (or stopwords list) is a list of commonly used words in a language that
# are usually ignored during natural language processing (NLP) tasks, such as text analysis or machine learning.)
for token in doc:
print(f"--- token: {token.text} ---")
print(f"lemma: {token.lemma_}\npos: {token.pos_}\ntag: {token.tag_}\ndep: {token.dep_}\nshape: {token.shape_}\nis_alpha:{token.is_alpha}\nis_stop: {token.is_stop}")
- Lemmatization (词形还原)
Lemmatization is the task of determining that two words have the same root, despite their surface differences. For some NLP situations, we also want two morphologically different forms of a word to behave similarly. For example in web search, someone may type the string woodchucks but a useful system might want to also return pages that mention woodchuck with no s.
- Example 1: The words am, are, and is have the shared lemma be.
- Example 2: The words dinner and dinners both have the lemma dinner .
text = """
The Brown Corpus, a text corpus of American English that was compiled in the 1960s at Brown University, \
is widely used in the field of linguistics and natural language processing. It contains about 1 million \
words (or "tokens") across a diverse range of texts from 500 sources, categorized into 15 genres, such \
as news, editorial, and fiction, to provide a comprehensive resource for studying the English language. \
This corpus has been instrumental in the development and evaluation of various computational linguistics \
algorithms and tools.
"""
text = text.replace("\n", " ").strip()
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
lemmas = [token.lemma_ for token in doc]
for ori,lemma in zip(doc[:30], lemmas[:30]):
print(ori, lemma)
- Stemming (词干提取)
Lemmatization algorithms can be complex. For this reason we sometimes make use of a simpler but cruder method, which mainly consists of chopping off words final affixes. This naive version of morphological analysis is called stemming.
import spacy
nlp = spacy.load("en_core_web_sm")
text = """\
This was not the map we found in Billy Bones's chest, but \
an accurate copy, complete in all things-names and heights \
and soundings-with the single exception of the red crosses \
and the written notes.\
"""
doc = nlp(text)
for tok in doc:
if tok.is_alpha:
print(tok.text, tok.lemma_)
- English Sentence tokenization
# A modern and fast NLP library that includes support for sentence segmentation.
# spaCy uses a statistical model to predict sentence boundaries, which can be more accurate
# than rule-based approaches for complex texts.
# Install via conda: conda install conda-forge::spacy
# Install via pip: pip install -U spacy
# Download data: python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Here is a sentence. Here is another one! And the last one.")
sentences = [sent.text for sent in doc.sents]
for ind, sent in enumerate(sentences):
print(f"sentence-{ind}: {sent}\n")
3. Datasets Exploration¶
3.1 Heap's Law¶
Word types and word instances (tokens)
Word types are the number of distinct words in a corpus; if the set of words in the vocabulary is $$ V =\{v_1,v_2,\ldots,v_{|V|}\}, $$ where $|V|$ is the number of types. We also call it, the vocabulary size $|V|$.
Word instances: Given a text corpus, we treat it as a sequence of tokens $[w_1,w_2,\ldots,w_T]$ after tokenization. We call $T$, the total number $T$ of running words in this corpus.
$N$: the number of word instances of corpus
$|V|$: the number of word types
The larger the corpora we look at, the more word types we find, and in fact this relationship between $|V|$ and $T$ is called Herdan's Law or Heaps' Law after its discoverers (in linguistics and information retrieval respectively). Given $k$ and $\beta$ positive constants, and $0<\beta<1$, it has the following form
$$ |V|=k \cdot T^\beta, $$
the value of $\beta$ depends on the corpus size and the genre. With English text corpora, typically $k\in [10,100]$, and $\beta \in [0.4, 0.6]$. For the large corpora, $\beta$ ranges from .67 to .75. Roughly then we can say that the vocabulary size for a text goes up significantly faster than the square root of its length in words. Let us test it! Check more on
3.2 Explore wikitext-2 dataset¶
- Please note that huggingface cannot be directly used in China. An alternative way is to use https://hf-mirror.com/.
import os
# If this does not work, please add export HF_ENDPOINT=https://hf-mirror.com in your env.
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
os.environ["HF_HUB_ETAG_TIMEOUT"] = "60"
os.environ["HF_HUB_DOWNLOAD_TIMEOUT"] = "60"
from datasets import load_dataset
# loads train/validation/test
ds = load_dataset("wikitext", "wikitext-2-raw-v1")
print(ds)
train = ds["train"]
for i in range(4):
print(train[i])
- Simple tokenization and testing the Heap's law.
import re
import numpy as np
import matplotlib.pyplot as plt
train = ds["train"]
start_time = time.time()
# simple word tokenizer (lowercased)
word_re = re.compile(r"[A-Za-z]+(?:'[A-Za-z]+)?")
def heaps_curve(dataset, step=1000):
V = set()
N = 0
Ns, Vs = [], []
for ex in dataset:
text = ex["text"]
if not text or text.strip() == "": # empty word -> continue
continue
# make all words to low cases
words = word_re.findall(text.lower())
for w in words:
N += 1
V.add(w)
if N % step == 0:
Ns.append(N)
Vs.append(len(V))
return np.array(Ns), np.array(Vs)
Ns, Vs = heaps_curve(train, step=1000)
# Fit log |V| = log k + beta log N -> linear regression on logs
logN = np.log(Ns)
logV = np.log(Vs)
beta, logk = np.polyfit(logN, logV, 1)
k = np.exp(logk)
print(f"Fitted Heaps' law: |V| ≈ {k:.2f} * N^{beta:.3f}")
print(f"Total runtime: {time.time() - start_time:.3f} seconds")
# Plot (log-log)
plt.figure(figsize=(7, 5))
plt.loglog(Ns, Vs, marker='o', linestyle='none', markersize=5, label="Empirical (Wikitext-2)")
# fitted line
Ns_line = np.linspace(Ns[0], Ns[-1], 200)
Vs_line = k * (Ns_line ** beta)
plt.loglog(Ns_line, Vs_line, label=fr"Fitted ($k={k:.3f},\beta=${beta:.3f})")
plt.xlabel("$N$ (tokens)", fontsize = 15)
plt.ylabel("$|V|$ (vocab)", fontsize = 15)
plt.title(r"Heaps' Law (Simple tokenization): $|V|=k T^{\beta}$", fontsize = 15)
plt.legend(fontsize = 15)
plt.tight_layout()
plt.show()
- Use spaCy tokenization and test Heap's Law
import time
import numpy as np
import matplotlib.pyplot as plt
import spacy
start_time = time.time()
nlp = spacy.load("en_core_web_sm", disable=["tagger","parser","ner","lemmatizer"])
# tokenizer still works even with pipeline disabled
train = ds["train"]
def heaps_curve_spacy(dataset, step=10_000, batch_size=256):
V = set()
N = 0
Ns, Vs = [], []
texts = (ex["text"] for ex in dataset if ex["text"] and ex["text"].strip())
for doc in nlp.pipe(texts, batch_size=batch_size):
for tok in doc:
# choose your definition of "word"
if tok.is_alpha:
w = tok.text.lower()
N += 1
V.add(w)
if N % step == 0:
Ns.append(N)
Vs.append(len(V))
return np.array(Ns), np.array(Vs)
Ns, Vs = heaps_curve_spacy(train, step=10_000)
# Fit log |V| = log k + beta log N
logN = np.log(Ns)
logV = np.log(Vs)
beta, logk = np.polyfit(logN, logV, 1)
k = np.exp(logk)
print(f"Fitted Heaps' law (spaCy): |V| ≈ {k:.2f} * N^{beta:.3f}")
print(f"Total runtime: {time.time() - start_time:.3f} seconds")
# Plot (log-log)
plt.figure(figsize=(7, 5))
plt.loglog(Ns, Vs, marker='o', linestyle='none', markersize=5, label="Empirical (Wikitext-2)")
# fitted line
Ns_line = np.linspace(Ns[0], Ns[-1], 200)
Vs_line = k * (Ns_line ** beta)
plt.loglog(Ns_line, Vs_line, label=fr"Fitted ($k={k:.3f},\beta=${beta:.3f})")
plt.xlabel("$N$ (tokens)", fontsize = 15)
plt.ylabel("$|V|$ (vocab)", fontsize = 15)
plt.title(r"Heaps' Law (Spacy tokenization): $|V|=kN^{\beta}$", fontsize = 15)
plt.legend(fontsize = 15)
plt.tight_layout()
plt.show()
3.3 Explore wikitext-103 dataset¶
We can download a larger dataset from the following:
export HF_HOME=$HOME/.cache/huggingface
huggingface-cli download Salesforce/wikitext --repo-type dataset --resume-download --include "wikitext-103-raw-v1/*.parquet"
from datasets import load_dataset
ds = load_dataset("wikitext", "wikitext-103-raw-v1") # will hit cache if present
print(ds)
train = ds["train"]
for i in range(4):
print(train[i])
start_time = time.time()
train = ds["train"]
# simple word tokenizer (lowercased)
word_re = re.compile(r"[A-Za-z]+(?:'[A-Za-z]+)?")
def heaps_curve(dataset, step=1000):
V = set()
N = 0
Ns, Vs = [], []
for ex in dataset:
text = ex["text"]
if not text or text.strip() == "": # empty word -> continue
continue
# make all words to low cases
words = word_re.findall(text.lower())
for w in words:
N += 1
V.add(w)
if N % step == 0:
Ns.append(N)
Vs.append(len(V))
return np.array(Ns), np.array(Vs)
Ns, Vs = heaps_curve(train, step=1000)
# Fit log |V| = log k + beta log N -> linear regression on logs
logN = np.log(Ns)
logV = np.log(Vs)
beta, logk = np.polyfit(logN, logV, 1)
k = np.exp(logk)
print(f"Fitted Heaps' law (spaCy): |V| ≈ {k:.2f} * N^{beta:.3f}")
print(f"Total runtime: {time.time() - start_time:.3f} seconds")
# Plot (log-log)
plt.figure(figsize=(7, 5))
plt.loglog(Ns, Vs, marker='o', linestyle='none', markersize=5, label="Empirical (Wikitext-103)")
# fitted line
Ns_line = np.linspace(Ns[0], Ns[-1], 200)
Vs_line = k * (Ns_line ** beta)
plt.loglog(Ns_line, Vs_line, label=fr"Fitted ($k={k:.3f},\beta=${beta:.3f})")
plt.xlabel("$N$ (tokens)", fontsize = 15)
plt.ylabel("$|V|$ (vocab)", fontsize = 15)
plt.title(r"Heaps' Law (Spacy tokenization): $|V|=kN^{\beta}$", fontsize = 15)
plt.legend(fontsize = 15)
plt.tight_layout()
plt.show()